# Load packages
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.0.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
We will work with data about variables related to blood pressures of Peruvian men who have moved from rural high altitude areas to urban lower altitude areas. A study related to that topic and including similar models is:
Hirschler, V., Gonzalez, C., Molinari, C., Velez, H., Nordera, M., Suarez, R., & Robredo, A. (2019). Blood pressure level increase with altitude in three argentinean indigenous communities. AIMS Public Health, 6(4), 370. https://dx.doi.org/10.3934%2Fpublichealth.2019.4.370
# Upload the data from GitHub
peru_bp <- read_csv("https://raw.githubusercontent.com/laylaguyot/datasets/main//peru_bp.csv")
## Rows: 39 Columns: 11
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (11): Age, Years, FracLife, Weight, Height, Chin, Forearm, Calf, Pulse, ...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# View first 6 rows
head(peru_bp)
## # A tibble: 6 × 11
## Age Years FracLife Weight Height Chin Forearm Calf Pulse Systol Diastol
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 21 1 0.0476 71 1629 8 7 12.7 88 170 76
## 2 22 6 0.273 56.5 1569 3.3 5 8 64 120 60
## 3 24 5 0.208 56 1561 3.3 1.3 4.3 68 125 75
## 4 24 1 0.0417 61 1619 3.7 3 4.3 52 148 120
## 5 25 1 0.04 65 1566 9 12.7 20.7 72 140 78
## 6 27 19 0.704 62 1639 3 3.3 5.7 72 106 72
It is always a good idea to explore our data:
library(psych)
##
## Attaching package: 'psych'
## The following objects are masked from 'package:ggplot2':
##
## %+%, alpha
# A fancy scatterplot matrix
pairs.panels(peru_bp[c("Systol","Age","Years","FracLife","Weight","Height","Chin","Forearm","Calf","Pulse")],
method = "pearson", # correlation method
hist.col = "#00AFBB", # color of histogram
smooth = FALSE, density = FALSE, ellipses = FALSE)
Even though pairwise comparisons do not let us see how multiple predictors might affect the response, we can get a sense of the relationships between each predictor and the response and between the predictors.
We start by comparing two models: a null model (only the intercept) with a full model with all potential predictors.
# Null model
fit_start <- lm(Systol ~ 1, data = peru_bp)
# Full model
fit_full <- lm(Systol ~ Age + Years + FracLife + Weight + Height + Chin + Forearm + Calf + Pulse,
data = peru_bp)
# Forward stepwise selection
fit_step <- step(fit_start, direction = "forward", scope = formula(fit_full))
## Start: AIC=201.71
## Systol ~ 1
##
## Df Sum of Sq RSS AIC
## + Weight 1 1775.38 4756.1 191.34
## + FracLife 1 498.06 6033.4 200.62
## + Forearm 1 484.22 6047.2 200.71
## + Calf 1 410.80 6120.6 201.18
## <none> 6531.4 201.71
## + Height 1 313.58 6217.9 201.79
## + Chin 1 189.19 6342.2 202.57
## + Pulse 1 119.88 6411.6 202.99
## + Years 1 49.98 6481.5 203.41
## + Age 1 0.22 6531.2 203.71
##
## Step: AIC=191.34
## Systol ~ Weight
##
## Df Sum of Sq RSS AIC
## + FracLife 1 1314.69 3441.4 180.72
## + Years 1 972.90 3783.2 184.41
## + Age 1 385.73 4370.3 190.04
## <none> 4756.1 191.34
## + Chin 1 143.63 4612.4 192.15
## + Calf 1 16.67 4739.4 193.20
## + Pulse 1 5.31 4750.8 193.30
## + Height 1 2.01 4754.0 193.32
## + Forearm 1 1.16 4754.9 193.33
##
## Step: AIC=180.72
## Systol ~ Weight + FracLife
##
## Df Sum of Sq RSS AIC
## + Chin 1 197.372 3244.0 180.42
## <none> 3441.4 180.72
## + Years 1 113.588 3327.8 181.41
## + Age 1 100.451 3340.9 181.57
## + Forearm 1 50.548 3390.8 182.15
## + Calf 1 30.218 3411.1 182.38
## + Height 1 23.738 3417.6 182.45
## + Pulse 1 6.878 3434.5 182.64
##
## Step: AIC=180.42
## Systol ~ Weight + FracLife + Chin
##
## Df Sum of Sq RSS AIC
## <none> 3244.0 180.42
## + Age 1 132.575 3111.4 180.79
## + Height 1 113.565 3130.4 181.03
## + Years 1 101.114 3142.9 181.18
## + Pulse 1 13.001 3231.0 182.26
## + Forearm 1 0.219 3243.8 182.42
## + Calf 1 0.003 3244.0 182.42
summary(fit_step)
##
## Call:
## lm(formula = Systol ~ Weight + FracLife + Chin, data = peru_bp)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.6382 -6.6316 0.4521 6.3593 24.2086
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 52.9092 15.0895 3.506 0.001266 **
## Weight 1.4407 0.2766 5.209 8.51e-06 ***
## FracLife -27.3522 7.1185 -3.842 0.000491 ***
## Chin -1.0135 0.6945 -1.459 0.153407
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.627 on 35 degrees of freedom
## Multiple R-squared: 0.5033, Adjusted R-squared: 0.4608
## F-statistic: 11.82 on 3 and 35 DF, p-value: 1.684e-05
At each step, R considers adding each available predictor
(+). It calculates the AIC (Akaike Information Criterion)
for the model if that predictor is added: R chooses the variable that
gives the greatest improvement (lowest AIC). We repeat this process
until no predictor improves AIC, and the model is finalized.
Why AIC? AIC balances fit and complexity. A lower AIC means a better trade-off between:
Goodness-of-fit (how well the model explains the data)
Complexity (how simple the model is = fewer predictors is better)
We use the regsubsets() function to compare all possible
models:
library(leaps)
# Run best subsets regression
best_models <- regsubsets(Systol ~ Age + Years + FracLife + Weight + Height + Chin + Forearm + Calf + Pulse,
data = peru_bp,
nvmax = 9, nbest = 1)
# Summarize output
summary(best_models)
## Subset selection object
## Call: regsubsets.formula(Systol ~ Age + Years + FracLife + Weight +
## Height + Chin + Forearm + Calf + Pulse, data = peru_bp, nvmax = 9,
## nbest = 1)
## 9 Variables (and intercept)
## Forced in Forced out
## Age FALSE FALSE
## Years FALSE FALSE
## FracLife FALSE FALSE
## Weight FALSE FALSE
## Height FALSE FALSE
## Chin FALSE FALSE
## Forearm FALSE FALSE
## Calf FALSE FALSE
## Pulse FALSE FALSE
## 1 subsets of each size up to 9
## Selection Algorithm: exhaustive
## Age Years FracLife Weight Height Chin Forearm Calf Pulse
## 1 ( 1 ) " " " " " " "*" " " " " " " " " " "
## 2 ( 1 ) " " " " "*" "*" " " " " " " " " " "
## 3 ( 1 ) " " " " "*" "*" " " "*" " " " " " "
## 4 ( 1 ) "*" "*" "*" "*" " " " " " " " " " "
## 5 ( 1 ) "*" "*" "*" "*" " " "*" " " " " " "
## 6 ( 1 ) "*" "*" "*" "*" " " "*" "*" " " " "
## 7 ( 1 ) "*" "*" "*" "*" "*" "*" "*" " " " "
## 8 ( 1 ) "*" "*" "*" "*" "*" "*" "*" " " "*"
## 9 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*" "*"
We can see which predictors are recommended to include for each number of predictors. Now we can identify the best based on various criteria:
# Create 2x2 grid of plots
par(mfrow = c(2, 2))
# Minimize SSE
plot(summary(best_models)$rss, xlab = "Number of predictors", ylab = "SSE", type = "l")
# Maximize Adjusted R^2
plot(summary(best_models)$adjr2, xlab = "Number of predictors", ylab = "Adjusted R²", type = "l")
# Minimize Mallow’s Cp
plot(summary(best_models)$cp, xlab = "Number of predictors", ylab = "Cp", type = "l")
# Minimize BIC
plot(summary(best_models)$bic, xlab = "Number of predictors", ylab = "BIC", type = "l")
How many predictors would you consider? Which one should be included?
# Best 5 predictors
fit_5 <- lm(Systol ~ Age + Years + FracLife + Weight + Chin,
data = peru_bp)
summary(fit_5)
##
## Call:
## lm(formula = Systol ~ Age + Years + FracLife + Weight + Chin,
## data = peru_bp)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.520 -6.640 -1.093 4.893 16.366
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 109.3590 21.4843 5.090 1.41e-05 ***
## Age -1.0120 0.3059 -3.308 0.002277 **
## Years 2.4067 0.7426 3.241 0.002723 **
## FracLife -110.8112 27.2795 -4.062 0.000282 ***
## Weight 1.0976 0.2980 3.683 0.000819 ***
## Chin -1.1918 0.6140 -1.941 0.060830 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.457 on 33 degrees of freedom
## Multiple R-squared: 0.6386, Adjusted R-squared: 0.5839
## F-statistic: 11.66 on 5 and 33 DF, p-value: 1.531e-06
Note: These methods do not consider any potential issues there might be with the model assumptions for example. We are also not considering potential interaction effects.
Multicollinearity occurs when two or more predictors are highly correlated with each other. This can make it difficult to estimate the individual effect of each predictor on the response.
One way to assess multicollinearity is to look at the Variance Inflation Factor (VIF). A VIF above 5 (or 10, depending on context) suggests potential multicollinearity.
library(car)
# Calculate VIF
vif(fit_5)
# Remove Years
fit_4 <- lm(Systol ~ Age + FracLife + Weight + Chin,
data = peru_bp)
summary(fit_4)
vif(fit_4)
Removing Years dropped the VIFs for the remaining
predictors, helping to address multicollinearity.
Let’s explore different datasets and address any potential issues that may come up!
We will work with data from the following study: Tang, X., Kong, D., Yan, X. (2018). Multiple Regression Analysis of a Woven Fabric Sound Absorber. Textile Research Journal. https://doi.org/10.1177/0040517518758001
# Upload the data from GitHub
acoustics <- read_csv("https://raw.githubusercontent.com/laylaguyot/datasets/main//acoustics.csv")
## Rows: 24 Columns: 11
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (11): sampleID, thickness, diameter, perforation, weight, stiffness, air...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# View first 6 rows
head(acoustics)
## # A tibble: 6 × 11
## sampleID thickness diameter perforation weight stiffness airPerm acoustic0
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 0.547 0.269 6.14 247 182. 805. 0.076
## 2 2 0.541 0.378 6.08 253 74.4 888. 0.092
## 3 3 0.875 0.584 4.97 366 71.4 598. 0.129
## 4 4 0.64 0.527 4.87 319 162. 799. 0.115
## 5 5 0.75 0.534 4.51 292 156. 753. 0.095
## 6 6 0.539 0.368 4.39 232 172. 704 0.073
## # ℹ 3 more variables: acoustic1 <dbl>, acoustic2 <dbl>, acoustic3 <dbl>
acoustic3 by (try 1 at a time): air permeability, weight,
perforation ratio, thickness. Check assumptions. Any issues
arise?# Explore data
pairs.panels(acoustics[c("acoustic3","airPerm","weight","perforation","thickness")],
method = "pearson", # correlation method
hist.col = "#00AFBB", # color of histogram
smooth = FALSE, density = FALSE, ellipses = FALSE)
# Fit model with perforation
fit_model <- lm(acoustic3 ~ perforation, data = acoustics)
summary(fit_model)
##
## Call:
## lm(formula = acoustic3 ~ perforation, data = acoustics)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.059594 -0.033618 0.004946 0.027706 0.062671
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.512424 0.022057 23.232 < 2e-16 ***
## perforation -0.042188 0.006112 -6.903 6.24e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.03803 on 22 degrees of freedom
## Multiple R-squared: 0.6841, Adjusted R-squared: 0.6698
## F-statistic: 47.65 on 1 and 22 DF, p-value: 6.238e-07
plot(fit_model)
# Try a transformation
acoustics |>
mutate(perforation_inv = 1/perforation) -> acoustics
# Fit model
fit_model <- lm(acoustic3 ~ perforation_inv, data = acoustics)
summary(fit_model)
##
## Call:
## lm(formula = acoustic3 ~ perforation_inv, data = acoustics)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.104264 -0.021164 -0.002658 0.029929 0.081648
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.25844 0.02209 11.697 6.49e-11 ***
## perforation_inv 0.32165 0.05829 5.518 1.52e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.04382 on 22 degrees of freedom
## Multiple R-squared: 0.5805, Adjusted R-squared: 0.5615
## F-statistic: 30.45 on 1 and 22 DF, p-value: 1.519e-05
plot(fit_model)
We will work with data from the following study: Şimşek, A. (2007). The use of 3D-nonlinear regression analysis in mathematics modeling of colour change in roasted hazelnuts. Journal of food engineering, 78(4), 1361-1370. https://doi.org/10.1016/j.jfoodeng.2006.01.008
# Upload the data from GitHub
hazelnuts <- read_csv("https://raw.githubusercontent.com/laylaguyot/datasets/main//hazelnuts.csv")
## Rows: 150 Columns: 6
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): type, measure
## dbl (4): experiment, temperature, time, color_change
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# View first 6 rows
head(hazelnuts)
## # A tibble: 6 × 6
## experiment temperature time color_change type measure
## <dbl> <dbl> <dbl> <dbl> <chr> <chr>
## 1 1 125 15 81.0 fosa whole
## 2 2 125 25 76.8 fosa whole
## 3 3 125 35 76.3 fosa whole
## 4 4 125 45 74.6 fosa whole
## 5 5 125 55 73.6 fosa whole
## 6 6 135 13 79.3 fosa whole
color_change
based on temperature and time. Check
assumptions. Any issues arise?# Explore data
pairs.panels(hazelnuts[c("color_change", "temperature","time")],
method = "pearson", # correlation method
hist.col = "#00AFBB", # color of histogram
smooth = FALSE, density = FALSE, ellipses = FALSE)
# Fit model
fit_model <- lm(color_change ~ temperature + time, data = hazelnuts)
# Summary
summary(fit_model)
##
## Call:
## lm(formula = color_change ~ temperature + time, data = hazelnuts)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.7564 -2.3274 0.2477 2.3283 6.0409
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 100.52162 2.85072 35.262 < 2e-16 ***
## temperature -0.14738 0.01772 -8.316 5.64e-14 ***
## time -0.24155 0.01995 -12.105 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.806 on 147 degrees of freedom
## Multiple R-squared: 0.5219, Adjusted R-squared: 0.5154
## F-statistic: 80.22 on 2 and 147 DF, p-value: < 2.2e-16
# Check assumptions
plot(fit_model)
Let’s consider a polynomial regression:
# Center the predictors and Squared predictors
hazelnuts <- hazelnuts |>
mutate(temp_c = temperature - mean(temperature),
time_c = time - mean(time),
temp2 = temp_c^2,
time2 = time_c^2,
temp_c.time_c = temp_c*time_c)
# Fit the polynomial regression model
fit_model <- lm(color_change ~ temp_c + temp2 + time_c + time2 + temp_c.time_c,
hazelnuts)
# Display the summary table for the regression model
summary(fit_model)
##
## Call:
## lm(formula = color_change ~ temp_c + temp2 + time_c + time2 +
## temp_c.time_c, data = hazelnuts)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.0395 -2.1019 0.0647 2.1811 4.9496
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 72.2383565 0.3949027 182.927 < 2e-16 ***
## temp_c -0.1693516 0.0171207 -9.892 < 2e-16 ***
## temp2 -0.0018928 0.0013632 -1.389 0.167
## time_c -0.2885604 0.0194290 -14.852 < 2e-16 ***
## time2 -0.0007211 0.0016982 -0.425 0.672
## temp_c.time_c -0.0091940 0.0018166 -5.061 1.25e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.502 on 144 degrees of freedom
## Multiple R-squared: 0.6277, Adjusted R-squared: 0.6148
## F-statistic: 48.56 on 5 and 144 DF, p-value: < 2.2e-16
# Check assumptions
plot(fit_model)
We will work with data from the following study: Sherwood, J., Inouye, C., Webb, S., Zhou A, Anderson, E., & Spink, N. (2019) Relationship between physical and cognitive performance in community dwelling, ethnically diverse older adults: a cross-sectional study. https://doi.org/10.7717/peerj.6159
# Upload the data from GitHub
cognition <- read_csv("https://raw.githubusercontent.com/laylaguyot/datasets/main//cognition.csv")
## Rows: 87 Columns: 9
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): Gender, Education
## dbl (7): ID, Age, Heart_Rate, MMS, Animal_Naming, TMTA, TMTB
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# View first 6 rows
head(cognition)
## # A tibble: 6 × 9
## ID Gender Age Education Heart_Rate MMS Animal_Naming TMTA TMTB
## <dbl> <chr> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 M 91 Graduate 109 96 20 34.8 154.
## 2 2 F 83 Undergraduate 112 98 18 45 157
## 3 3 F 91 Undergraduate 99 78 10 30.5 122.
## 4 4 F 76 Graduate 105 94 16 49 58
## 5 5 F 63 Highschool 88 94 21 26.6 77
## 6 6 F 75 Undergraduate 109 94 19 40.9 128
TMTB based on
Heart_Rate and Education. Check assumptions.
Any issues arise?# Explore data
pairs.panels(cognition[c("TMTB", "Heart_Rate","Education")],
method = "pearson", # correlation method
hist.col = "#00AFBB", # color of histogram
smooth = FALSE, density = FALSE, ellipses = FALSE)
# Fit model
fit_model <- lm(TMTB ~ Heart_Rate + Education, data = cognition)
summary(fit_model)
##
## Call:
## lm(formula = TMTB ~ Heart_Rate + Education, data = cognition)
##
## Residuals:
## Min 1Q Median 3Q Max
## -80.651 -29.717 -8.452 17.209 169.778
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 151.0546 34.7086 4.352 3.82e-05 ***
## Heart_Rate -0.6003 0.2761 -2.174 0.032546 *
## EducationHighschool 52.0220 13.4825 3.858 0.000225 ***
## EducationUndergraduate 11.7843 12.6032 0.935 0.352488
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 47.36 on 83 degrees of freedom
## Multiple R-squared: 0.2721, Adjusted R-squared: 0.2458
## F-statistic: 10.34 on 3 and 83 DF, p-value: 7.43e-06
plot(fit_model)
# Try two transformations
cognition |>
mutate(TMTB_log = log10(TMTB),
TMTB_sqrt = sqrt(TMTB)) -> cognition
# Fit model with log
fit_model <- lm(TMTB_log ~ Heart_Rate + Education, data = cognition)
summary(fit_model)
##
## Call:
## lm(formula = TMTB_log ~ Heart_Rate + Education, data = cognition)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.37289 -0.14213 -0.00944 0.11190 0.52441
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.210853 0.133966 16.503 < 2e-16 ***
## Heart_Rate -0.002939 0.001066 -2.758 0.007150 **
## EducationHighschool 0.201953 0.052039 3.881 0.000208 ***
## EducationUndergraduate 0.049082 0.048645 1.009 0.315922
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1828 on 83 degrees of freedom
## Multiple R-squared: 0.3066, Adjusted R-squared: 0.2815
## F-statistic: 12.23 on 3 and 83 DF, p-value: 1.051e-06
plot(fit_model)
# Fit model with sqrt
fit_model <- lm(TMTB_sqrt ~ Heart_Rate + Education, data = cognition)
summary(fit_model)
##
## Call:
## lm(formula = TMTB_sqrt ~ Heart_Rate + Education, data = cognition)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.7453 -1.4760 -0.1839 1.0811 7.0060
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12.44816 1.56914 7.933 8.84e-12 ***
## Heart_Rate -0.03140 0.01248 -2.515 0.013815 *
## EducationHighschool 2.39838 0.60953 3.935 0.000172 ***
## EducationUndergraduate 0.57306 0.56978 1.006 0.317452
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.141 on 83 degrees of freedom
## Multiple R-squared: 0.2964, Adjusted R-squared: 0.271
## F-statistic: 11.65 on 3 and 83 DF, p-value: 1.893e-06
plot(fit_model)
The log-transformation addressed the issues with the assumptions the best.
We will work with data from the following study: Zendle, D. (2020) Beyond loot boxes: a variety of gambling-like practices in video games are linked to both problem gambling and disordered gaming. https://doi.org/10.7717/peerj.9466
# Upload the data from GitHub
gaming <- read_csv("https://raw.githubusercontent.com/laylaguyot/datasets/main//gaming.csv")
## Rows: 1108 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): mainGame
## dbl (12): age, gender, ethnicity, gameplayFreq, gambling, esports, token, si...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# View first 6 rows
head(gaming)
## # A tibble: 6 × 13
## age gender ethnicity gameplayFreq gambling esports token simulated
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 38 0 1 1 11 0 0 0
## 2 50 0 0 1 6 0 0 0
## 3 32 0 0 1 6 0 0 0
## 4 38 0 0 1 9 0 0 1
## 5 60 0 0 1 0 0 0 0
## 6 58 0 0 1 3 0 0 0
## # ℹ 5 more variables: real_money <dbl>, lootbox <dbl>, problem_gambling <dbl>,
## # gaming_disorder <dbl>, mainGame <chr>
Some reasonable outcomes to model would be indicators of
problem_gambling or gaming_disorder. We should
fit a logistic regression model.